TEXT SEGMENTATION 
BASED ON SIMILARITY BETWEEN WORDS 



Hideki Kozima 

Course in Computer Science and Information Mathematics, 
Graduate School, University of Electro-Communications 
1-5-1, Chofugaoka, Chofu, 
Tokyo 182, Japan 
(xkozimaOphaeton. cs .uec .ac.jp) 



m 
(N 



> 

o 
o 

I 

o 

I 



X 



Abstract 

This paper proposes a new indicator of text struc- 
ture, called the lexical cohesion profile (LCP), 
which locates segment boundaries in a text. A 
text segment is a coherent scene; the words in 
a segment are linked together via lexical cohesion 
relations. LCP records mutual similarity of words 
in a sequence of text. The similarity of words, 
which represents their cohesiveness, is computed 
using a semantic network. Comparison with the 
text segments marked by a number of subjects 
shows that LCP closely correlates with the hu- 
man judgments. LCP may provide valuable in- 
formation for resolving anaphora and ellipsis. 

INTRODUCTION 

A text is not just a sequence of words, but it has 
coherent structure. The meaning of each word 
can not be determined until it is placed in the 
structure of the text. Recognizing the structure 
of text is an essential task in text understanding, 
especially in resolving anaphora and ellipsis. 

One of the constituents of the text struc- 
ture is a text segment. A text segment, whether 
or not it is explicitly marked, as are sentences and 
paragraphs, is defined as a sequence of clauses or 
sentences that display local coherence. It resem- 
bles a scene in a movie, which describes the same 
objects in the same situation. 

This paper proposes an indicator, called the 
lexical cohesion profile (LCP), which locates seg- 
ment boundaries in a narrative text. LCP is a 
record of lexical cohesiveness of words in a se- 
quence of text. Lexical cohesiveness is defined 
as word similarity (Kozima and Furugori, 1993) 
computed by spreading activation on a semantic 
network. Hills and valleys of LCP closely corre- 
late with changing of segments. 



SEGMENTS AND COHERENCE 

Several methods to capture segment boundaries 
have been proposed in the studies of text struc- 
ture. For example, cue phrases play an impor- 
tant role in signaling segment changes. (Grosz 
and Sidner, 1986) However, such clues are not di- 
rectly based on coherence which forms the clauses 
or sentences into a segment. 

Youmans (1991) proposed VMP (vocabu- 
lary management profile) as an indicator of seg- 
ment boundaries. VMP is a record of the number 
of new vocabulary terms introduced in an inter- 
val of text. However, VMP does not work well on 
a high-density text. The reason is that coherence 
of a segment should be determined not only by 
reiteration of words but also by lexical cohesion. 

Morris and Hirst (1991) used Roget's the- 
saurus to determine whether or not two words 
have lexical cohesion. Their method can cap- 
ture almost all the types of lexical cohesion, e.g. 
systematic and non-systematic semantic relation. 
However it does not deal with strength of cohe- 
siveness which suggests the degree of contribution 
to coherence of the segment. 

Computing Lexical Cohesion 

Kozima and Furugori (1993) defined lexical co- 
hesiveness as semantic similarity between words, 
and proposed a method for measuring it. Sim- 
ilarity between words is computed by spreading 
activation on a semantic network which is system- 
atically constructed from an English dictionary 
(LDOCE). 

The similarity cr(w,w') G [0,1] between 
words w,w' is computed in the following way: 
(1) produce an activated pattern by activating 
the node w; (2) observe activity of the node w' 
in the activated pattern. The following examples 
suggest the feature of the similarity a: 

a (cat, pet) = 0.133722 , 
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Figure 1. An activated pattern of a word list 
(produced from {red, alcoholic, drink}). 



a (cat, hat) = 0.001784 , 

u (waiter, restaurant) = 0.175699 , 
a (painter, restaurant) = 0.006260 . 

The similarity a depends on the significance 
s{w) e [0, 1], i.e. normalized information of the 
word w in West's corpus (1953). For example: 

s(red) = 0.500955 , s(and) = 0.254294 . 

The following examples show the relationship be- 
tween the word significance and the similarity: 

a (waiter, waiter) = 0.596803 , 
a (red, blood) = 0.111443 , 

a (of, blood) = 0.001041 . 

LEXICAL COHESION PROFILE 

LCP of the text T = {(I'l,- ■ •, wm} is a sequence 
{c(S'i),- • •, c{Sn)} of lexical cohesiveness c{Si). Si 
is the word list which can be seen through a fixed- 
width window centered on the i-th word of T: 

Si^{wi,Wi + i, ■ ■ • , Wi, Wi+i, • • ■,Wr-l,Wr}, 

l=i-A (if i< A, then / = 1), 
r=i+A {ii i>N-A, then r = N). 

LCP treats the text T as a word list without any 
punctuation or paragraph boundaries. 

Cohesiveness of a Word List 

Lexical cohesiveness c(5j) of the word list Si is 
defined as follows: 

<Si) = j:n,es,siw)-aiP{Si),w) , 

where a{P{Si),w) is the activity value of the 
node w in the activated pattern P{Si). P{Si) 
is produced by activating each node w € Si with 
strength s(w)^/^s(w). Figure 1 shows a sam- 
ple pattern of {red, alcoholic, drink}. (Note 
that it has highly activated nodes like bottle and 
wine.) 
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Figure 2. Correlation between LCP 
and text segments. 

0.6-1 

LCP 




0.3 H 1 1 1 1 1 

100 200 300 400 500 

i (words) 
Figure 3. An example of LCP 
(using rectangular window of A=25) 



The definition of c{Si) above expresses that 

c{Si) represents semantic homogeneity of Si, 
since P{Si) represents the average meaning of 
wG5,. For example: 

c( "Molly saw a cat. It was her family 
pet. She wished to keep a lion." 
= 0.403239 (cohesive), 

c( "There is no one but me. Put on 
your clothes . I can not walk more . " 
= 0.235462 (not cohesive). 

LCP and Its Feature 

A graph of LCP, which plots c{Si) at the text 
position i, indicates changing of segments: 

• If Si is inside a segment, it tends to be co- 
hesive and makes c{Si) high. 

• If Si is crossing a segment boundary, it tends 
to semantically vary and makes c{Si) low. 

As shown in Figure 2, the segment boundaries 
can be detected by the valleys (minimum points) 
of LCP. 

The LCP, shown in Figure 3, has large hills 
and valleys, and also meaningless noise. The 
graph is so complicated that one can not easily 



determine which valley should be considered as a 
segment boundary. 

The shape of the window, which defines 
weight of words in it for pattern production, 
makes LCP smooth. Experiments on several win- 
dow shapes (e.g. triangle window, etc.) shows 
that Banning window is best for clarifying the 
macroscopic features of LCP. 

The width of the window also has effect on 
the macroscopic features of LCP, especially on 
separability of segments. Experiments on several 
window widths ( A = 5 ~ 60) reveals that the Han- 
ning window of A = 25 gives the best correlation 
between LCP and segments. 

VERIFICATION OF LCP 

This section inspects the correlation between 
LCP and segment boundaries perceived by the 
human judgments. The curve of Figure 4 shows 
the LCP of the simplified version of O.Henry's 
"Springtime a la Carte" (Thornley, 1960). The 
solid bars represent the histogram of segment 
boundaries reported by 16 subjects who read the 
text without paragraph structure. 

It is clear that the valleys of the LCP cor- 
respond mostly to the dominant segment bound- 
aries. For example, the clear valley at i = 110 
exactly corresponds to the dominant segment 
boundary (and also to the paragraph boundary 
shown as a dotted line). 

Note that LCP can detect segment changing 
of a text regardless of its paragraph structure. 
For example, i = 156 is a paragraph boundary, 
but neither a valley of the LCP nor a segment 
boundary; i = 236 is both a segment boundary 
and approximately a valley of the LCP, but not 
a paragraph boundary. 

However, some valleys of the LCP do not 
exactly correspond to segment boundaries. For 
example, the valley near i = 450 disagree with 
the segment boundary at i = 465. The reason is 
that lexical cohesion can not cover all aspect of 
coherence of a segment; an incoherent piece of 
text can be lexically cohesive. 

CONCLUSION 

This paper proposed LCP, an indicator of seg- 
ment changing, which concentrates on lexical 
cohesion of a text segment. The experiment 
proved that LCP closely correlate with the seg- 
ment boundaries captured by the human judg- 
ments, and that lexical cohesion plays main role 
in forming a sequence of words into segments. 



Text segmentation described here provides 
basic information for text understanding: 

• Resolving anaphora and ellipsis: 
Segment boundaries provide valuable re- 
striction for determination of the referents. 

• Analyzing text structure: 

Segment boundaries can be considered as 
segment switching (push and pop) in hier- 
archical structure of text. 

The segmentation can be applied also to text 
summarizing. (Consider a list of average meaning 
of segments.) 

In future research, the author needs to ex- 
amine validity of LCP for other genres — Hearst 
(1993) segments expository texts. Incorporating 
other clues (e.g. cue phrases, tense and aspect, 
etc.) is also needed to make this segmentation 
method more robust. 
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Figure 4. Correlation between LCP and segment boundaries. 



